Q($\lambda$) with Off-Policy Corrections
نویسندگان
چکیده
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(λ). We illustrate this theoretical relationship empirically on a continuous-state control task.
منابع مشابه
تأثیر مقیاس زمان سرمایش و غیر ایدهآل بودن گاز در امواج ضربه ای
According to the suddenly compression of the matters in some regions of the compressible fluids, the density and temperature suddenly increases, and shockwaves can be produced. The cooling of post-shock region and non-idealness of the equation of state, $p=(k_B/mu m_p)rho T (1+brho) equivmathcal{K}rho T (1+eta R)$, where $mu m_p$ is the relative density of the post-shock gas and $Requiv rho_2 /...
متن کاملInverse Sturm-Liouville problems with transmission and spectral parameter boundary conditions
This paper deals with the boundary value problem involving the differential equation ell y:=-y''+qy=lambda y, subject to the eigenparameter dependent boundary conditions along with the following discontinuity conditions y(d+0)=a y(d-0), y'(d+0)=ay'(d-0)+b y(d-0). In this problem q(x), d, a , b are real, qin L^2(0,pi), din(0,pi) and lambda is a parameter independent of x. By defining a new...
متن کاملInverse Sturm-Liouville problem with discontinuity conditions
This paper deals with the boundary value problem involving the differential equation begin{equation*} ell y:=-y''+qy=lambda y, end{equation*} subject to the standard boundary conditions along with the following discontinuity conditions at a point $ain (0,pi)$ begin{equation*} y(a+0)=a_1 y(a-0),quad y'(a+0)=a_1^{-1}y'(a-0)+a_2 y(a-0), end{equation*} where $q(x), a_1 , a_2$ are rea...
متن کاملQ-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic
Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD-style methods, such as off-policy act...
متن کاملComplete O(α) QED corrections to the process ep → eX in mixed variables
The complete set of O(α) QED corrections with soft photon exponentiation to the process ep → eX in mixed variables (y = yh, Q 2 = Ql ) is calculated in the quark parton model, including the lepton-quark interference and the quarkonic corrections which were unknown so far. The interference corrections amount to few percent or less and become negligible at small x. The leading logarithmic terms p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016